10/25/2018

Goals for this week

  • Finish simple linear regression
  • Analysis of residuals
  • How to report your statistical results
  • Non-linear regression
  • Multiple linear regression
  • Git and GitHub
  • One factor ANOVA
  • Experimental Design (Tuesday)

Non-Linear Regression

Complex non-linear regression

one response and one predictor

Complex non-linear regression

one response and one predictor

  • power
  • exponential
  • polynomial

Complex non-linear regression

one response and one predictor

Multiple Linear Regression

Multiple Linear Regression - Goals

  • To develop a better predictive model than is possible from models based on single independent variables.

  • To investigate the relative individual effects of each of the multiple independent variables above and beyond the effects of the other variables.

  • The individual effects of each of the predictor variables on the response variable can be depicted by single partial regression lines.

  • The slope of any single partial regression line (partial regression slope) thereby represents the rate of change or effect of that specific predictor variable (holding all the other predictor variables constant to their respective mean values) on the response variable.

Multiple Linear Regression

Additive and multiplicative models of 2 or more predictors

Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]

Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + B_3x_{i1}x_{i2} + \epsilon_i\]

Multiple Liner Regression

Additive and multiplicative models

R INTERLUDE

multiple regression with 2 predictor variables

  • Read in the RNAseq_lip.tsv and examine the continuous variables.
  • We are interested in whether Gene01 and/or Gene02 expression levels influence lipid content.
  • First plot \(Y\) vs \(Gene01\), \(Y\) vs \(Gene02\), then set up and test a multiplicative model.
  • What are the parameter estimates of interest in this case?
  • What are the outcomes from our hypothesis tests?
RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t')

y <- RNAseq_Data$Lipid_Conc
g1 <- RNAseq_Data$Gene01
g2 <- RNAseq_Data$Gene02
g3 <- RNAseq_Data$Gene03
g4 <- RNAseq_Data$Gene04

Mult_lm <- lm(y ~ g1*g2)
summary(Mult_lm)

R INTERLUDE

multiple regression with 2 predictor variables

  • Now get rid of the interaction term, and set up a purely additive model
  • Did any of our estimates change? Why?
  • Did the degrees of freedom change? Why?
Add_lm <- lm(y ~ g1+g2)
summary(Add_lm)
  • Finally try different combinations of genes

R INTERLUDE

Now see if a polynomial fits better

RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t')

RNAseq_lm_linear <- lm(y ~ g4)
summary (RNAseq_lm_linear)
influence.measures(RNAseq_lm_linear)

RNAseq_lm_poly <- lm(y ~ poly(g4, 2))
summary (RNAseq_lm_poly)

lines(y, fitted(lm(y ~ poly(g4, 2))), type="l", col="blue")
influence.measures(RNAseq_lm_poly)
  • Try this for different genes
  • Try this for different degree polynomials

Multiple linear regression assumptions

  • linearity
  • normality
  • homogeneity of variance
  • multi-collinearity - a predictor variable must not be correlated to the combination of other predictor variables.

checking for multi-collinearity

library(car)
scatterplotMatrix(~var1+var2+var3, diag=”boxplot”)

R INTERLUDE

multiple regression with three variables

RNAseq3_lm <- lm(y ~ g1+g2+g3)
summary(RNAseq3_lm)
plot(RNAseq3_lm)
library(car)
scatterplotMatrix(~g1+g2+g3, diag=”boxplot”)
scatterplotMatrix(~y+g1+g2+g3, diag=”boxplot”)

To get tolerance (1-\(R^2\)) calculations

1/vif(lm(y ~ g1 + g2 + g3))
  • Is there any evidence for multi-collinearity?
  • Try running a simple linear regression for y and Gene03 (y ~ g3).
  • What is the slope estimate for Gene03, and how does it compare to the partial slope estimate above?

Model selection

Model selection

the problems

  • How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….

  • Which variables to keep/ discard when building a multiple regression model?

  • Selecting from candidate models representing different biological processes.

Model selection

a beetle example

Start with linear regression

Quadratic (2nd degree) polynomial?

Quadratic (3rd degree) polynomial?

A polynomial of degree 5?

A polynomial of degree 10?

The problem with this approach

  • The log likelihood of the model increases with the number of parameters
  • So does \(r^2\)
  • Isn’t this good - the best fit to the data?

The problem with this approach

  • Does it violate a principle of parsimony
  • Fit no more parameters than is necessary.
  • If two or more models fit the data almost equally well, prefer the simpler model.

“models should be pared down until they are minimal and adequate”

Crawley 2007, p325

Let’s consider our objectives

  • Model should predicts well

  • Approximates true relationship between the variables

  • Be able to evaluate a wider array of models. Not only or more “reduced” models.

  • NOTE: Reduced vs. full models are referred to as nested models. Non-subset models are called non-nested models.

  • Don’t confuse with nested experimental designs or sampling designs.

How do we accomplish these goals?

How to accomplish these goals To answer this, we need

  • A criterion to compare models:
    • Mallow’s Cp
    • AIC (Akaike’s Information Criterion)
    • BIC (Bayesian Information Criterion)
  • A strategy for searching the candidate models

How do we accomplish these goals?

  • How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….

  • Which variables to keep/ discard when building a multiple regression model?

  • Selecting from candidate models representing different biological processes.

  • MSresiduals- represents the mean amount of variation unexplained by the model, and therefore the lowest value indicates the best fit.

  • Adjusted \(r^2\) - (the proportion of mean amount of variation in response variable explained by the model) is calculated as adj. r2 which is adjusted for both sample size and the number of terms. Larger values indicate better fit.

  • Mallow’s Cp - is an index resulting from the comparison of the specific model to a model that contain all the possible terms. Models with the lowest value and/or values closest to their respective p (the number of model terms, including the y-intercept) indicate best fit.

  • Akaike Information Criteria (AIC) - there are several different versions of AIC, each of which adds a different constant designed to penalize according to the number of parameters and sample size to a likelihood function to produce a relative measure of the information content of a model. Smaller values indicate more parsimonious models. As a rule of thumb, if the difference between two AIC values (delta AIC) is greater than 2, the lower AIC is a significant improvement in parsimony.

  • Schwarz or Bayesian Information Criteria (BIC or SIC) - is outwardly similar to AIC. The constant added to the likelihood function penalizes models with more predictor terms more heavily (and thus select more simple models) than AIC. It is for this reason that BIC is favored by many researchers.

Mallow’s Cp

  • Frequently used in multiple regression
  • Test case: “all subsets regression”
  • Strategy: Test all possible models and selection the one with the smallest Cp
  • leaps package in R. Smart search among a potentially huge number of models
  • Typically we are modeling observational data. Experimental data often allows smart use of variables
  • Investigating all possible subsets of variables = only intelligent decision is choice of variables ….

Mallow’s Cp

Mallow’s Cp

For prediction: All models with Cp < p predict about equally well. Don’t get carried away with a “best”.

For explanation: If numerous equally well fitting models fit the data, it is difficult to deduce which predictor “explains” the response.

General caveat: “regression is not causation”. Experiment needed to get to causal explanation.

ANOVA

ANOVA

  • Stands for ANalysis of VAriance
  • Core statistical procedure in biology
  • Developed by R.A. Fisher in the early 20th Century
  • The core idea is to ask how much variation exists within vs. among groups
  • ANOVAs are linear models that have categorical predictor and continuous response variables
  • The categorical predictors are often called Factors, and can have two or more levels (important to specify in R)
  • Each factor will have a hypothesis test
  • The levels of each factor may also need to be tested

ANOVA

let’s start with an example

  • Percent time that male mice experiencing discomfort spent “stretching”.

  • Data are from an experiment in which mice experiencing mild discomfort (result of injection of 0.9% acetic acid into the abdomen) were kept in:

    • isolation,
    • with a companion mouse not injected, or
    • with a companion mouse also injected and exhibiting “stretching” behaviors associated with discomfort.
  • The results suggest that mice stretch the most when a companion mouse is also experiencing mild discomfort. Mice experiencing pain appear to “empathize” with co-housed mice also in pain.

From Langford, D. J.,et al. 2006. Science 312: 1967-1970

ANOVA

let’s start with an example

In words: stretching = intercept + treatment




The model statement includes a response variable, a constant, and an explanatory variable.

The only difference with regression is that here the explanatory variable is categorical.

ANOVA

let’s start with an example

ANOVA

conceptually similar to regression

ANOVA

conceptually similar to regression

ANOVA

conceptually similar to regression

ANOVA

F-ratio calculation

ANOVA

F-ratio calculation

R INTERLUDE

One way ANOVA

  • Again, use the RNAseq_lip.tsv data again.
  • Let’s test for an effect of Population on Gene01 expression levels
  • First, let’s look at how the data are distributed
RNAseq_Data <- read.table('RNAseq_lip.tsv', header=T, sep='\t')

g1 <- RNAseq_Data$Gene01
Pop <- RNAseq_Data$Population

boxplot(g1~Pop, col=c("blue","green"))

#Or, to plot all points:
stripchart(g1~Pop, vertical=T, pch=19, col=c("blue","green"),
           at=c(1.25,1.75), method="jitter", jitter=0.05)

Pop_Anova <- aov(g1 ~ Pop)
summary(Pop_Anova)

ANOVA

Simple and more complex

  • One-way ANOVAs just have a single factor
  • Multi-factor ANOVAs can be quite complex
    • Factorial - two or more factors and their interactions
    • Nested - the levels of one factor are contained within another level
  • ANOVAs use an F-statistic to test factors in a model
    • ratio of two variances (numerator and denominator)
    • the numerator and denominator d.f. need to be included (e.g. \(F_{1, 34} = 29.43\)) Determining the appropriate test ratios for complex ANOVAs takes some work

ANOVA

Assumptions

  • Normally distributed groups (robust to non-normality if equal variances and Ns)
  • Equal variances across groups (okay if largest-to-smallest variance ratio < 3:1) (problematic if mean-variance relationship among groups)
  • Observations in a group are independent (randomly selected; don’t confound group with another factor)

ANOVA

Fixed effects

  • Groups are predetermined, of direct interest, repeatable. For example:
    • medical treatments in a clinical trial
    • predetermined doses of a toxin
    • age groups in a population
    • habitat, season, etc.
  • Any conclusions reached in the study about differences among groups can be applied only to the groups included in the study. The results cannot be generalized to other treatments, habitats, etc. not included in the study.

ANOVA

Random effects

  • Measurements that come in groups. A group can be:
    • a family made up of siblings
    • a subject measured repeatedly
    • a transect of quadrats in a sampling survey
    • a block of an experiment done at a given time
  • Groups are assumed to be randomly sampled from a population of groups.
  • Therefore, conclusions reached about groups can be generalized to the population of groups.
  • With random effects, the variance among groups is the main quantity of interest, not the specific group attributes.

ANOVA

Random effects

  • Whenever your sampling design is nested (quadrats within transects; transects within woodlots; woodlots within districts).
  • Whenever your replicates are grouped spatially or temporally (i.e., in blocks, which are typically analyzed as random effects).
  • Whenever you divide up plots, and apply separate treatments to subplots.
  • Whenever you take measurements on related individuals.
  • Whenever you measure subjects or other sampling units repeatedly.

ANOVA

Random effects - test your understanding

  • Factor is sex (Male vs. Female)

  • Factor is fish tank (10 tanks in an experiment)

  • Factor is family (measure multiple sibs per family)

  • Factor is temperature (10 arbitrary temps over nat. range)

ANOVA

Caution about fixed vs. random effects

  • Most statistical packages assume that all factors are fixed unless you instruct it otherwise.
  • Designating factors as random takes extra work and probably a read of the manual.
  • In R, lm assumes that all effects are fixed.
  • For random effects, use lme instead (part of the nlme package).

Git and GitHub